Using Policies to Restrict Tools
Understand how to enforce SOW with policies to be applied to the workflows.
We'll cover the following
Rate limiting is great for preventing a bad tool run from wiping out a service when all items of work are equal. But not all items of work are equal, as some machine services are more important and fragile than others (such as our service’s database systems). Also, machines or services may need to be put into logical groupings that can only happen in a limited amount. These could be broken up by sites, geographical areas, and so on.
This logic is generally specific to some set of work items. This bundling, which we will call a SOW, can be quite complex.
To safely do work, we must understand our scope. This might be how we can safely update database schemas for a particular service or how many route reflectors in a network region can be modified at a time.
To implement safety around a SOW, we will introduce the idea of policies. Policies will be used to check a set of work that is entering into the system for compliance. If it is not compliant, it will be rejected.
As an example, we will look at handling disk erasures similar to Google's disk-erase case study. Here are some protections we will add:
Only allow a single satellite disk erasure to happen every hour.
Rate limit so that we can only erase five machines at a time.
Must pause for 1 minute after each five-machine erasure.
To be able to make a policy engine, we must have a common way to define what kind of work will be executed, in what order, and with what concurrency.
We also want the tool engineers to only define the work to be done and submit it to a separate service that executes it. This allows for the centralization of control.
Let's define the service that could do that in gRPC.
Defining a gRPC workflow service#
Let's look at the arguments to make calls to see what our clients will send the workflow service, as follows:
These messages are used to define the work that a client wants the server to execute and contain the following attributes:
Lines 1–5:
WorkReqmessage contains the name of the work and allBlockmessages that make up a workflow.Lines 11–15: The
Blockmessage describes a body of work in the workflow; eachBlockexecutes one at a time and has the following attributes:Has a set of
Jobmessages that describe the work to be done.At what concurrency to execute the work described by the
Jobmessages.The Job message describes the server's Job type on the server to call and with which arguments.
Lines 7–9: The
WorkRespmessage returns the ID that refers to thisWorkReq:Uses
UUIDv1IDs that encapsulate time into the ID, so we know when it was submitted to the system.Uses that time mechanic to prevent execution if the
Exec() RPCis not called in by some expiration time.
Exec messages provide the ID we want to execute, as illustrated here:
There are more messages and enums to allow for a Status call. We can find the complete protocol buffer definition below:
/
Now that we have messages to describe the work to be done, let's look at creating a policy engine.
Creating a policy engine#
A policy checks our work to make sure some parameter is allowed. In our case, these parameters are inside a pb.WorkReq instance. We want policies to be generic so that they can be reused against multiple types of work described by a pb.WorkReq. Once defined, we will have a policy.json file that defines which policies are applied against a specifically named pb.WorkReq.
To make this work, each policy will need to receive the settings for the policy that should be applied to a specific workflow. Let's define two interfaces that describe a policy and its settings, as follows:
Settings will always be implemented as some struct. Its Validate() method will be used to validate that the fields for that struct are set to valid values.
Policy runs our implementation against a pb.WorkReq with the settings provided.
Each WorkReq that is submitted will have a list of policies to apply. This is defined as follows:
Name is the name of the policy to invoke. Settings are the settings for that invocation.
The configuration file will detail a set of PolicyArgs arguments to run. Each policy will need to be registered in the system. We are going to skip the registration method for policies, but this is where the policies are registered:
When a pb.WorkReq enters the system, we want to invoke those policies concurrently against that pb.WorkReq. Let's have a look at how that would work here:
This preceding code defines the following:
Lines 3–5: If the configuration for a
pb.WorkReqhas no policies, return.Line 7: Create a
Contextobject so that we can cancel policies being run on an error.Line 12: Clone our
pb.WorkReqso that it cannot be changed by aPolicy.Line 19: Make sure each
Policythat is named actually exists.Line 26: Run all our policies with the settings that we were given.
Lines 27–29: If there is an error in any of them, record it and cancel all running policies.
Lines 62–64: Make sure the copy of
pb.WorkReqis the same as what was submitted.
We now have the main parts of a policy engine. The full engine can be found here.
The Reader type that is used to read our policy.json file where we define policies is detailed here.
Let's look at writing a policy to be used by our engine.
Writing a policy#
One of the most basic policies that we can define against a workflow is to limit which job types are allowed in that workflow.
This prevents some new type of work from being introduced into a workflow where no one has thought about policies that need to be applied to that Job.
For our first Policy implementation, let's write one that checks our pb.WorkReq to allow only Job types we have defined in our policy configuration. If we receive an unexpected Job, we reject the pb.WorkReq.
Let's define the settings for our Policy, as follows:
This preceding code contains the following:
Lines 2–4: Our specific
Settingsthat implementpolicy.Settings.Line 8:
AllowedJobs, which are the names of the jobs we allow.Lines 7–15: A
Validate()method that validates the listedJobsexist.Lines 17–24: An
allowed()method that checks a given name against what we allow.Line 9: It also uses our
jobspackage to do these checks.
With these settings, a user can define a policy for any workflow in our configuration file that defines which Job types are allowed.
Let's define a type that implements the Policy interface as follows:
This preceding code does the following:
Line 2: Defines our policy, which implements the
policy.Policyinterface.Lines 5–7: Defines a
New()constructor.Line 10: Implements the
policy.Policy.Run()method.Lines 13–16: Validates the
policy.Settingsvalue passed are theSettingsfor thisPolicy.Line 18: Loops through all our
req.Blocksand gets ourJobinstances.Lines 24–26: Checks each
Jobhas an allowed name.
We now have a policy we can apply to restrict Job types in a pb.WorkReq. This is how we could apply that in our configuration file to a workflow that does satellite disk erasures:
This policy has the following attributes:
Is applied only to workflows called
"SatelliteDiskErase".Has a single policy applied,
"restrictJobTypes", which we defined.Allows only
Jobtypes called one of the following:"validateDecom""diskErase""sleep""getTokenFromBucket"
We can see the full Policy implementation below:
/
Cautions on policy engines#
Before we move on, we would like to provide a word of caution.
Simplicity is the key to sustainable software. We could define sustainable software as having the following attributes:
Easy to debug
Users can understand how to use it in a few hours at most
Policy engines can be amazingly effective in preventing major problems, acting as a secondary check on sanity to some set of actions. As with security, it should provide substantial benefits while only introducing a small burden.
Policy engines are easy to overdevelop, with the lofty goal of 100% protection while introducing a large amount of complexity and burden. Often, we may see policy engines that are not tightly coupled to a single workflow system. Instead, engineers will design a generic system that tries to deal with multiple tooling systems.
If our policy statements start to look like a programming language (if statements, loops, functions), we are moving toward complexity. As policy engines become generic, they become complex to deal with. If we need policy enforcement in multiple places, this is another warning sign.
Not all workflows can achieve safety with generic policies. When we have a complex workflow, feel free to design a policy that does deep checks for a single workflow. We need to keep our if statements, loops, and functions in our code, not our configuration.
Sometimes engineers write lots of overcomplicated safety systems. Focus on providing guard rails that are easy to write and update while covering 80% of cases, not 100% of cases. With the division between software that creates a set of actions to run and a service that validates those actions against policies, we are unlikely to have a disk-erase type of event in the future, and importantly, we will be able to maintain velocity.
In this lesson, we discussed what an SOW would be. To allow our workflow service to understand an SOW, to enforce it, we have designed a policy engine and created our first policy that can be applied to workflows submitted to our system.
Even with policies, something is going to go wrong. This could simply be a confluence of events that makes a normally safe operation unsafe.
Building Workflows That Are Repeatable and Never Lost
Building Systems With an Emergency Stop